Adapting magnetoresistive memory devices for accurate and on-chip-training-free in-memory computing

Memristors have emerged as promising devices for enabling efficient multiply-accumulate (MAC) operations in crossbar arrays, crucial for analog in-memory computing (AiMC). However, variations in memristors and associated circuits can affect the accuracy of analog computing. Typically, this is mitigated by on-chip training, which is challenging for memristors with limited endurance. We present a hardware-software codesign using magnetic tunnel junction (MTJ)–based AiMC off-chip calibration that achieves software accuracy without costly on-chip training. Hardware-wise, MTJ devices exhibit ultralow cycle-to-cycle variations, as experimentally evaluated over 1 million mass-produced devices. Software-wise, leveraging this, we propose an off-chip training method to adjust deep neural network parameters, achieving accurate AiMC inference. We validate this approach with MAC operations, showing improved transfer curve linearity and reduced errors. By emulating large-scale neural network models, our codesigned MTJ-based AiMC closely matches software baseline accuracy and outperforms existing off-chip training methods, highlighting MTJ’s potential in AI tasks.

Another concern in designing bitcells in AiMC systems is the device variations in the memory device.The different conductance shifts due to the device variation and the circuit noise of each cell in the crossbar array reduce the read margin between different states.Differential cells exhibit more resilience to noise and interference than binary cells, as they store information in the difference between two devices rather than a single absolute value.This is particularly beneficial for memory devices with a low on-off ratio, such as MRAM, where the differential cell can double the dynamic range (51).This improved robustness can strengthen error resilience for multiplyaccumulate (MAC) operations in AiMC systems (18).Supplementary Note 2: IR Drop in the Crossbar Array The computation of a crossbar array relies on the conductance of the memory devices.Although the wire resistance of the interconnects is typically minuscule at approximately 1 ohm per cell, it can cumulatively have a significant impact on in-memory computing accuracy (44,52) when the AiMC system is implemented with large arrays (Fig. S1A).The output of the proposed differential cell is obtained by calculating the difference between the positive source line ( + ) and negative source line ( − ).The current on the source line with respect to one cell can ideally be calculated by   =   .However, due to the presence of wire resistance, the conductance of the target cell is degraded.Consequently, the output current of this differential cell can be expressed as the equation in Fig. S1B.
The wire resistance decreases the conductivity of the cell, subsequently reducing the output yield of the cell, with a strong dependence on the cell position in the array (Fig. S1B).This poses a challenge for increasing the integration scale of AiMC systems since the accuracy of MAC operations and AI tasks can be severely affected by the increased wire resistance in large arrays.Appropriate error detection and compensation mechanisms are necessary to restore data integrity and computation accuracy.
For some memristor devices, the conductance can be tuned with an analog flavor (22,44), allowing the wire resistance in the array to be modeled and compensated for by adjusting the target conductance of the memristor.However, the MTJ has only two conductance levels, rendering direct adjustments to the device conductance impossible.Instead, the parameters programmed into the device can be adjusted to fit the degraded conductance of the target device.The proposed scheme can adaptively quantize the parameters according to the device-specific conductance shift look-up table (DSCS-LUT).Each differential cell is accessed and measured during the conductance shift sensing process to obtain the device-to-device (DtD) variation.In the measurement, the sensed conductance shift inherently includes the wire resistance, implying that the DSCS-LUT also contains the necessary information (wire resistance) to compensate for the IR drop through adaptive quantization.Supplementary Note 3: Model Capacity Improvement Using Adaptive Quantization Quantized activations and weights in neural networks often limit model performance due to the reduced optimization space compared to floating-point number represented models.In other words, the capacity of quantized models is inherently smaller than that of full-precision models.Recent research has introduced additional parameters to dynamically adjust the step and range of quantized models to address this issue, thereby increasing the optimization space and improving performance (53)(54)(55).In line with the evolving trend of novel quantization schemes in the computer science community and considering the unique properties of MRAM variation characteristics, we propose an adaptive quantization scheme that leverages the inherent device variations of the MTJs to quantize parameters for efficient and accurate AiMC inference adaptively.Unlike dynamic quantization on digital computers, which requires extra floating-point parameters to adjust the range and step of the parameters, the adaptive quantization scheme can dynamically adjust the parameters using the inherently non-uniform steps and ranges based on the knowledge of the conductance shifts of the AiMC chip's devices.Consequently, no additional floating-point parameter is needed to enlarge the quantized model capacity, reducing the cost of implementing such a scheme in terms of extra storage and computing.
To demonstrate the advantage of utilizing the intrinsic device variation of MTJs in the AiMC, we designed a linear regression experiment with a simple target function, () = .A single-layer perceptron model without a bias term is used in the regression task, as the computation of the synapses can be directly compared to the MAC operation in a single column of the AiMC crossbar array.The difference between the summation of the weights and the target slope,  is the error of the model.The digital quantized model's weight is represented by ternary values (-1, 0, +1), resulting in the best possible regression value of 3 (Fig. S2A).However, when deploying a model with ternary weights to the AiMC macro, the conductance of each cell does not precisely equal the target conductance due to device variation.This variation allows the model to obtain a larger optimization space for better regression.As shown in Fig. S2A, although the weight states are ternary for both digital (-1, 0, 1) and adaptive quantization (P/AP, AP/AP, AP/P), there is a chance for the adaptive quantization in the AiMC to achieve superior results.In this case, an increase in device variation does not necessarily lead to the degradation of the AiMC's performance, as more significant conductance shifts can be combined to achieve better outcomes.We implemented the adaptive quantization scheme on five different device batches (A-E) with varying target MTJ dimensions (CDs) with different device variations (Fig. S2B).The experimental results (Fig. S2C) show that the fitting error increases with the variation if the neuron numbers are small.This is because the optimization space happens to be trapped in a worse condition due to the randomness of the variation.To reduce the chance of the parameter space being trapped, a sufficient number of synapses (devices) are needed.Thus, the regression error decreases when the number of neurons increases.It seems that to ensure the model is not easily trapped in a worse condition, the model should be enlarged to ensure enough parameters are used.The requirement of the parameter number raises another concern about the feasibility of deploying such adaptive methods to smaller models with fewer parameters.Generally, it is believed that the landscape of the loss functions in finite parameter models is non-convex and has multiple local minima (56).This indicates that deep learning models are unlikely to be trapped by the randomly generated parameter exploration space, as shown in the simple linear regression case.The results of the experiment that uses models of different sizes and tasks with varying complexities in the main text demonstrate the generalization of adaptive quantization.Supplementary Note 4: Adaptive Quantization Implementation Cost a. Adaptive Quantization Precision In-memory computing systems employing ternary cells can execute multiple-bit computations by calculating individual bits and combining partial summations through the shift and add operations.The total weight level of the conventional quantization method for a ternary bit cell is given by 2   +1 − 1.The proposed adaptive quantization scheme uses conductance shift sensing to measure deviations of the real conductance from the target value.As a result, the output of the Analog-to-Digit Converter (ADC) should be symmetric and centered at the target value, with a total level of 2   − 1.The conductance shift sensing includes an offset value of a differential cell which can reduce the minimal resolution of the weight to 1/(2   − 2) LSB.Adaptive quantization precision is determined jointly by ADC precision and weight bit number.The total level of quantized weights can be expressed by equation (1): Here,   represents the bit number of weights, and   is the bit precision of the conductance shift sense ADC, which must be greater than 1-bit.The   rises with the increased precision of weights and conductance shift sensing, as illustrated in Fig. S3A.Increasing   necessitates more devices to store additional bits and more cycles to calculate the partial sum.Increasing ADC precision can also increase the number of equivalent weight levels at the expense of a larger conductance shift sensing circuit area.As shown in Fig. S3B, the equivalent weight state number baseline is 1-bit without conductance shift sense (without adaptive quantization), which is 1×.Solely increasing the weight bit number to 5-bit can only raise the state number by 21×.In contrast, increasing the conductance shift sense ADC bit to 5-bit can raise the state number by 60.3×.In our design, we utilized a 2-bit weight with a 3-bit sense ADC, achieving a 36.3×improvement in weight state number compared to the 1-bit baseline.
Recent research has shown that with high precision sensing techniques, a parameter can be mapped into the memristor arrays with arbitrarily high precision for analog computing (34).We also evaluated what is the optimal precision for our conductance shift sensing process to balance the tradeoff between accuracy and cost.As illustrated in Fig. S3C, the most significant improvement in weight mapping accuracy occurs when the conductance shift sensing precision increases from 2-bit to 3-bit.Further increasing precision does not yield substantial improvements in weight mapping accuracy.Furthermore, the cost of implementing high-precision sensing circuits escalates exponentially.In this work, we directly reuse the compute ADC as the conductance shift ADC and incorporate additional bias circuits to configure the sensing circuits for different scenarios, achieving negligible extra chip area cost.

b. Cost of Conductance Shift Sensing
As mentioned previously, the extra hardware requirement is negligible for supporting conductance shift sensing.The main cost of this process is the energy consumption and memory required to store the DSCS-LUT.Suppose the array size is 256 ×512 (256×256 differential cells).The conductance shift sensing needs to sense the cells under [−1,0, +1] states, this requires 256×256×3=196,608 write and read operations, using 3.4mJ of power (200ns write and read pulse).The storage consumption of the DSCS-LUT using 3-bit ADC is 256×256×3×3=589,824 bit, about 73.7 KB.
Since the adaptive quantization is only executed once while a model is compiled to be deployed to the chip for the first time, the model inference does not rely on the DSCS-LUT.Thus, there are two options for doing the adaptive quantization depending on the restriction of the edge device: 1.Storage bounded: The DSCS-LUT is deleted after the adaptive quantization of the model is done.In this case, the conductance shift sensing is needed to obtain a new DSCS-LUT each time a new model is deployed.

2.Energy bounded:
The DSCS-LUT is stored in the main storage of the edge device.The DSCS-LUT is loaded for use each time a new model is deployed.

Supplementary Note 5: Data Dependency of the Macro
In DNN inference, the combination of inputs and weights is highly random.DtD variation and the intrinsic data dependency of the crossbar array and output circuit design necessitate an exploration of variation distribution in relation to data dependency.Assuming the conductance of a device follows a normal distribution  ~ (,  2 ), the output current of a single column can be expressed as  = ∑

𝑟
. When the input  of a specific cell is zero, no current flows through the device.Consequently, the distribution of the output variance is related to the number of activated rows n and can be written as ~(∑      ,  2 ).To analyze weight dependency, we first assume that all rows are activated in the column, resulting in a constant output variance equal to 256 2 .However, this prediction relies on a sufficient number of samples, such as those from the entire crossbar array.When considering a single column, the number of samples correlates with the MAC value, since there are not enough samples to support the statistical distribution.For example, if the expected MAC value is 256, which equals the number of rows in the array, the output current becomes deterministic, as all cells must be set to 1.The number of samples is largest when the MAC value is 0 and decreases as the absolute value increases.
For the input test (Fig. S4), the entire array was set to 1 (  + /  − ), and a read voltage (0.1 V) was applied to the activated BLs.Due to the variation in the array, even with a fixed number of activated rows, the possible combinations of activated rows could be extensive (  256  for n activated rows in an array with 256 rows).We randomly selected 500 combinations for each MAC value and repeated the tests on 200 columns.
The weight data dependency was tested by activating all rows in the array and sweeping the weight value from -256 to 256.Similar to the input data test, a single weight value (expected current value) has a vast number of combinations.Due to DtD variation, the same combination for one weight sum value in different columns can yield significantly different output currents.This phenomenon results in varying distributions of real current versus ideal current when tested on one or multiple columns.Consequently, we plotted two different figures to illustrate these distributions: one obtained from 1e5 combinations for each MAC value in 200 different columns (Fig. S5A) and another from a single column (Fig. S6).Fig. S7A illustrates the optimization breakdown of adaptive quantization for the MAC operation.As the CtC variation increases, the proportion of errors that the adaptive quantization can correct reduces.Notably, this percentage drops from 85% under 0% CtC variation to 28% under 100% CtC variation.Fig. S7B presents the accuracy curve of each method tested on the MNIST dataset.At low CtC variation levels, both In-situ training and adaptive quantization can achieve flawless performance.However, as CtC variation increases, the optimization grounded on conductance shift awareness becomes less reliable, leading to a decline in the performance of In-situ training and adaptive quantization.Ultimately, if the CtC variation is exceedingly large, these two methods' optimization becomes entirely dysfunctional.Their performance becomes similar with the VAT method, which relies solely on the device's statistical model and the model's parameter redundancy.As depicted in Fig. S7C, the In-situ training, adaptive quantization, and VAT exhibit similar performance when the CtC variation exceeds 75%.
Adaptive quantization is designed to capture the spatially random yet temporally fixed variation of devices.Hence, it favors devices with low CtC variation.However, our emulation indicates that adaptive quantization can still deliver substantial optimization for devices that suffer from significant CtC variations.

Fig. S1. The IR drop caused by the interconnect wire resistance in the crossbar array. (A)
the output of the differential cell is obtained by sensing the differential voltage converted from the source line current by the TIA, the conductance of each MTJ in the differential cell is also configured differentially.The wire resistance mainly contributed from the bit line and the source line and is proportional to the wire length.The IR drop can reduce the yield of the differential cell (B) the output distorted by the wire resistance can be modeled as a function of the position of the cell in the array and the yield can be plotted.*The conductance is normalized.**In this case, the conductance of each device is measured by external circuits and the sensed shift is down sampled to 3-bit.For actual implementation, the sensing is done by on-chip circuits and this table will only contain the state and calibrated conductance.

Fig. S2 .
Fig. S2.Linear regression experiment.(A) illustration of how digital quantization and adaptive quantization work.In digital quantization, weights are only used as quantized values that are represented by the device states.The quantized value is adjusted for adaptive quantization according to the conductance shift of each particular differential cell.The adaptive quantized model can potentially provide better regression results than conventional digital quantization.(B) The conductance variation measured on five batches of MRAM devices fabricated with different eCDs.The fitting curve shows that the scaling variation increases with the decrease of the eCD.(C)Linear regression error conducted on five batches of devices.For adaptive quantization, purely increasing the number of neurons/devices can increase the performance of the model, which is not possible for digital quantization models.The result also shows that the larger variation caused by device scaling does not decrease the model performance when an adequate number of neurons is used in the model.

Fig. S3 .
Fig. S3.How weight bit number and conductance shift sensing ADC precision influence the total weight levels.(A) Visualization of the weight distribution change, and (B) the total weight levels concerning the change of the weight bit and ADC bit precision.(C)Increasing the ADC precision can also reduce the average weight mapping error and thus jointly increase the AI inference accuracy.Due to the CtC variation overhead, the effectiveness of increasing the ADC precision is reducing.Thus, the optimal trade-off of the ADC bit is 3.

Fig. S5 .
Fig. S5.The MAC error analysis of the weight dependency is tested on 200 columns and each MAC value is tested for 500 combinations.(A) The transfer function shows that when tested on multiple columns, the variation is not dependent on the weight value (B) the error bit counts in LSB of the digital output code from the ADC before and after the DVAQ calibration, and (C) the absolute error before and after the DVAQ.(D) When the error occurs, the pie chart shows the percentage of the MAC that is better/equal/worse than the original output.

Fig. S6 .
Fig. S6.The MAC error analysis of the weight dependency is tested on a single column and each MAC value is tested for 1e5 combinations.

Fig. S8 .
Fig. S8.Resistance bit map of multiple 32 by 32 subregions tested at the low resistance state.

Fig. S10 .
Fig. S10.Comparison between experiment and simulation results on MNIST use TCD devices.(A) The weight mapping error of each co-design scheme in experiments and simulations.The mean error values of the simulation are equal to the experiments.(B) The output feature map error of the first convolution layer in LeNet.The simulated results have a similar mean error as the experiment.

Table S1 . Variation in different emerging non-volatile memories
Not enough number of tested devices to conclude † Devices are programmed to target conductance and then measure *